IAGLR 2021 H2O Workshop
1 Agenda
- Welcomes and introductions – Timothy Maguire (UMich CIGLR) and Jo-fai Chow (H2O.ai) (5 mins)
- Great Lakes research questions driven by data – Timothy Maguire (15 mins)
- Introduction to the H2O.ai software – Jo-fai Chow (5 mins)
- Live-code example of data manipulation and machine learning analysis – Jo-Fai Chow (20 mins)
- Results in context – Timothy Maguire (10 mins)
- Q&A (5 mins)
2 Welcome
3 About Great Lakes Research
slide_02
slide_03
slide_04
slide_05
slide_06
4 Introduction to H2O
h2o
h2o
h2o
h2o
h2o
5 Software and Code
5.1 Code
setup.R: install packages requiredtutorial.Rmd: the main RMarkdown file with codetutorial.html: this webpage- GitHub Repo https://github.com/woobe/IAGLR_2021_H2O_Workshop
5.2 R Packages
- Check out
setup.R - For this tutorial:
h2ofor automatic and explainable machine learning
- For RMarkdown
knitrfor rendering this RMarkdownrmdformatsforreadthedownRMarkdown templateDTfor nice tables
6 H2O Basics
# Let's go
library(h2o) # for H2O Machine Learning6.1 Start a local H2O Cluster (JVM)
h2o.init() Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 1 hours 54 minutes
H2O cluster timezone: Europe/London
H2O data parsing timezone: UTC
H2O cluster version: 3.32.1.2
H2O cluster version age: 17 days
H2O cluster name: H2O_started_from_R_joe_cqu561
H2O cluster total nodes: 1
H2O cluster total memory: 7.80 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
R Version: R version 4.0.5 (2021-03-31)
# Optional settings
h2o.no_progress() # disable progress bar for RMarkdown
h2o.removeAll() # Optional: remove anything from previous session # Enter your lucky seed here ...
n_seed <- 123457 Data - Dom Lake Huron Abund
# Import CSV from GitHub
lake_data <- h2o.importFile('https://raw.githubusercontent.com/woobe/IAGLR_2021_H2O_Workshop/main/data/Dom_Lake_Huron_Abund.csv')# Show first few samples
kable(head(lake_data, 5))| C1 | Year | Region | Station | Replicate | lat | lon | depth | substrate | Season | Na2O | Magnesium | MagnesiumOxide | AluminumOxide | SiliconDioxide | Quartz | PhosphorusPentoxide | Sulphur | DDT_TOTAL | P.P.TDE | P.P.DDE | P.P.DDE.1 | HeptachlorEpoxide | Potassium | PotassiumOxide | Calcium | CalciumOxide | TitaniumDioxide | Chromium | Manganese | ManganeseOxide | XTR.Iron | Iron | IronOxide | Cobalt | Nickel | Copper | Zinc | Selenium | Strontium | Beryllium | Silver | Cadmium | Tin | TotalCarbon | OrganicCarbon | Carbon.Nitrogen_Ratio | TotalNitrogen | Mercury | Lead | Uranium | Vanadium | Arsenic | Chloride | Fluoride | Sand | Silt | Clay | Mean_grainsize | simpsonD | shannon | Chironomidae | Oligochaeta | Dreisseniidae | Sphaeriidae | Diporeia | DPOL | DBUG |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2006 | Saginaw Bay | 10 | 1 | 43.94167 | -83.62383 | 11 | silt | 3 | 1.2 | 6221.4 | 1.7 | 7.5 | 76.4 | 70.4 | 0.1 | 0 | 1.1 | 0.1 | 0.8 | 5.2 | 0.1 | 2527.1 | 2.4 | 1892.5 | 1.3 | 0.4 | 69.0 | 2034.5 | 0.2 | 3.3 | 24354.6 | 2.8 | 11.9 | 49.9 | 28.7 | 67.3 | 10.6 | 59.0 | 0.5 | 0.2 | 1.3 | 5.5 | 0.4 | 1.0 | 8.0 | 0.1 | 154.4 | 40.9 | 0.4 | 38.5 | 1.1 | 23.3 | 402.2 | 52.3 | 14.0 | 33.4 | 4.5 | 0.8645201 | 2.238382 | 0 | 3684.24 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
| 2 | 2006 | Saginaw Bay | 10 | 2 | 43.94167 | -83.62383 | 11 | silt | 3 | 1.2 | 6221.4 | 1.7 | 7.5 | 76.4 | 70.4 | 0.1 | 0 | 1.1 | 0.1 | 0.8 | 5.2 | 0.1 | 2527.1 | 2.4 | 1892.5 | 1.3 | 0.4 | 69.0 | 2034.5 | 0.2 | 3.3 | 24354.6 | 2.8 | 11.9 | 49.9 | 28.7 | 67.3 | 10.6 | 59.0 | 0.5 | 0.2 | 1.3 | 5.5 | 0.4 | 1.0 | 8.0 | 0.1 | 154.4 | 40.9 | 0.4 | 38.5 | 1.1 | 23.3 | 402.2 | 52.3 | 14.0 | 33.4 | 4.5 | 0.8128663 | 1.961416 | 0 | 2270.52 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
| 3 | 2006 | Saginaw Bay | 10 | 3 | 43.94167 | -83.62383 | 11 | silt | 3 | 1.2 | 6221.4 | 1.7 | 7.5 | 76.4 | 70.4 | 0.1 | 0 | 1.1 | 0.1 | 0.8 | 5.2 | 0.1 | 2527.1 | 2.4 | 1892.5 | 1.3 | 0.4 | 69.0 | 2034.5 | 0.2 | 3.3 | 24354.6 | 2.8 | 11.9 | 49.9 | 28.7 | 67.3 | 10.6 | 59.0 | 0.5 | 0.2 | 1.3 | 5.5 | 0.4 | 1.0 | 8.0 | 0.1 | 154.4 | 40.9 | 0.4 | 38.5 | 1.1 | 23.3 | 402.2 | 52.3 | 14.0 | 33.4 | 4.5 | 0.8504612 | 2.040843 | 0 | 4712.40 | 0.00 | 64.26 | 0 | 0.00 | 0.00 |
| 4 | 2006 | Saginaw Bay | 11 | 1 | 44.02050 | -83.57367 | 9 | silty sand | 3 | 1.1 | 6160.2 | 1.5 | 6.5 | 79.5 | 74.5 | 0.1 | 0 | 1.1 | 0.1 | 0.7 | 5.3 | 0.1 | 2495.2 | 2.2 | 1830.8 | 1.2 | 0.3 | 55.3 | 2076.3 | 0.2 | 2.8 | 24068.9 | 2.3 | 9.8 | 41.0 | 23.7 | 54.8 | 10.5 | 49.4 | 0.4 | 0.2 | 1.1 | 5.4 | 0.4 | 0.8 | 7.1 | 0.1 | 150.2 | 35.6 | 0.4 | 31.3 | 0.9 | 23.2 | 402.3 | 61.0 | 11.3 | 27.4 | 3.9 | 0.7438019 | 1.754836 | 0 | 2034.90 | 0.00 | 0.00 | 0 | 0.00 | 0.00 |
| 5 | 2006 | Saginaw Bay | 11 | 2 | 44.02050 | -83.57367 | 9 | silty sand | 3 | 1.1 | 6160.2 | 1.5 | 6.5 | 79.5 | 74.5 | 0.1 | 0 | 1.1 | 0.1 | 0.7 | 5.3 | 0.1 | 2495.2 | 2.2 | 1830.8 | 1.2 | 0.3 | 55.3 | 2076.3 | 0.2 | 2.8 | 24068.9 | 2.3 | 9.8 | 41.0 | 23.7 | 54.8 | 10.5 | 49.4 | 0.4 | 0.2 | 1.1 | 5.4 | 0.4 | 0.8 | 7.1 | 0.1 | 150.2 | 35.6 | 0.4 | 31.3 | 0.9 | 23.2 | 402.3 | 61.0 | 11.3 | 27.4 | 3.9 | 0.6572315 | 1.341344 | 0 | 3941.28 | 342.72 | 0.00 | 0 | 85.68 | 257.04 |
h2o.describe(lake_data) Label Type Missing Zeros PosInf NegInf Min Max Mean
1 C1 int 0 0 0 0 1.0000 885.00000 443.00000
2 Year int 0 0 0 0 2006.0000 2012.00000 2009.00113
3 Region enum 0 96 0 0 0.0000 3.00000 NA
4 Station enum 0 24 0 0 0.0000 122.00000 NA
5 Replicate int 0 1 0 0 0.0000 3.00000 1.99887
6 lat real 0 0 0 0 43.2695 46.23333 44.68487
Sigma Cardinality
1 255.6217909 NA
2 2.3727809 NA
3 NA 4
4 NA 123
5 0.8190319 NA
6 0.8554837 NA
[ reached 'max' / getOption("max.print") -- omitted 62 rows ]
h2o.hist(lake_data$DPOL, breaks = 100)h2o.hist(lake_data$DBUG, breaks = 100)7.1 Define Target and Features
# Define targets
target_DPOL <- "DPOL"
target_DBUG <- "DBUG"
# Remove targets, C1, and Dreisseniidae (which is DPOL + DBUG)
features <- setdiff(colnames(lake_data), c(target_DPOL, target_DBUG, "C1", "Dreisseniidae"))
print(features) [1] "Year" "Region" "Station"
[4] "Replicate" "lat" "lon"
[7] "depth" "substrate" "Season"
[10] "Na2O" "Magnesium" "MagnesiumOxide"
[13] "AluminumOxide" "SiliconDioxide" "Quartz"
[16] "PhosphorusPentoxide" "Sulphur" "DDT_TOTAL"
[19] "P.P.TDE" "P.P.DDE" "P.P.DDE.1"
[22] "HeptachlorEpoxide" "Potassium" "PotassiumOxide"
[25] "Calcium" "CalciumOxide" "TitaniumDioxide"
[28] "Chromium" "Manganese" "ManganeseOxide"
[31] "XTR.Iron" "Iron" "IronOxide"
[34] "Cobalt" "Nickel" "Copper"
[37] "Zinc" "Selenium" "Strontium"
[40] "Beryllium" "Silver" "Cadmium"
[43] "Tin" "TotalCarbon" "OrganicCarbon"
[46] "Carbon.Nitrogen_Ratio" "TotalNitrogen" "Mercury"
[49] "Lead" "Uranium" "Vanadium"
[52] "Arsenic" "Chloride" "Fluoride"
[55] "Sand" "Silt" "Clay"
[58] "Mean_grainsize" "simpsonD" "shannon"
[61] "Chironomidae" "Oligochaeta" "Sphaeriidae"
[64] "Diporeia"
7.2 Split Data into Train/Test
h_split <- h2o.splitFrame(lake_data, ratios = 0.75, seed = n_seed)
h_train <- h_split[[1]] # 80% for modelling
h_test <- h_split[[2]] # 20% for evaluationdim(h_train)[1] 656 68
dim(h_test)[1] 229 68
7.3 Cross-Validation
8 Worked Example - Target “DPOL”
8.1 Baseline Regression Models
h2o.glm(): H2O Generalized Linear Modelh2o.randomForest(): H2O Random Forest Modelh2o.gbm(): H2O Gradient Boosting Modelh2o.deeplearning(): H2O Deep Neural Network Modelh2o.xgboost(): H2O wrapper for eXtreme Gradient Boosting Model (XGBoost) from DMLC
Let’s start with GBM
# Build a default (baseline) GBM
model_gbm_DPOL <- h2o.gbm(x = features, # All features
y = target_DPOL, # Target
training_frame = h_train, # H2O dataframe with training data
nfolds = 5, # Using 5-fold CV
seed = n_seed) # Your lucky seed# Cross-Validation
model_gbm_DPOL@model$cross_validation_metricsH2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 4620.394
RMSE: 67.97348
MAE: 16.30104
RMSLE: NaN
Mean Residual Deviance : 4620.394
# Evaluate performance on test
h2o.performance(model_gbm_DPOL, newdata = h_test)H2ORegressionMetrics: gbm
MSE: 2613.399
RMSE: 51.12142
MAE: 12.70523
RMSLE: NaN
Mean Residual Deviance : 2613.399
Let’s use RMSE
Build Other Baseline Models (GLM, DRF, GBM, DNN) - TRY IT YOURSELF!
# Try other H2O models
# model_glm <- h2o.glm(x = features, y = target, ...)
# model_gbm <- h2o.gbm(x = features, y = target, ...)
# model_drf <- h2o.randomForest(x = features, y = target, ...)
# model_dnn <- h2o.deeplearning(x = features, y = target, ...)
# model_xgb <- h2o.xgboost(x = features, y = target, ...)8.2 Manual Tuning
8.2.1 Check out the hyper-parameters for each algo
?h2o.glm
?h2o.randomForest
?h2o.gbm
?h2o.deeplearning
?h2o.xgboost8.2.2 Train a xgboost model with manual settings
model_gbm_DPOL_m <- h2o.gbm(x = features,
y = target_DPOL,
training_frame = h_train,
nfolds = 5,
seed = n_seed,
# Manual Settings based on experience
learn_rate = 0.1, # use a lower rate (more conservative)
ntrees = 120, # use more trees (due to lower learn_rate)
sample_rate = 0.7, # use random n% of samples for each tree
col_sample_rate = 0.7) # use random n% of features for each tree8.2.3 Comparison (RMSE: Lower = Better)
# Create a table to compare RMSE from different models
d_eval <- data.frame(model = c("GBM: Gradient Boosting Model (Baseline)",
"GBM: Gradient Boosting Model (Manual Settings)"),
stringsAsFactors = FALSE)
d_eval$RMSE_cv <- NA
d_eval$RMSE_test <- NA
# Store RMSE values
d_eval[1, ]$RMSE_cv <- model_gbm_DPOL@model$cross_validation_metrics@metrics$RMSE
d_eval[2, ]$RMSE_cv <- model_gbm_DPOL_m@model$cross_validation_metrics@metrics$RMSE
d_eval[1, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm_DPOL, newdata = h_test))
d_eval[2, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm_DPOL_m, newdata = h_test))
# Show Comparison (RMSE: Lower = Better)
kable(d_eval)| model | RMSE_cv | RMSE_test |
|---|---|---|
| GBM: Gradient Boosting Model (Baseline) | 67.97348 | 51.12142 |
| GBM: Gradient Boosting Model (Manual Settings) | 67.87922 | 51.06768 |
8.3 H2O AutoML
# Run AutoML (try n different models)
# Check out all options using ?h2o.automl
automl_DPOL = h2o.automl(x = features,
y = target_DPOL,
training_frame = h_train,
nfolds = 5, # 5-fold Cross-Validation
max_models = 30, # Max number of models
stopping_metric = "RMSE", # Metric to optimize
exclude_algos = c("deeplearning", "xgboost"), # exclude some algos for a quick demo
seed = n_seed)8.3.1 Leaderboard
# Show the table
kable(as.data.frame(automl_DPOL@leaderboard))| model_id | mean_residual_deviance | rmse | mse | mae | rmsle |
|---|---|---|---|---|---|
| GBM_grid__1_AutoML_20210517_094305_model_7 | 4022.796 | 63.42551 | 4022.796 | 15.04154 | NA |
| GBM_grid__1_AutoML_20210517_094305_model_3 | 4057.498 | 63.69849 | 4057.498 | 15.14776 | NA |
| GBM_grid__1_AutoML_20210517_094305_model_2 | 4083.103 | 63.89917 | 4083.103 | 15.55026 | NA |
| GBM_3_AutoML_20210517_094305 | 4091.409 | 63.96412 | 4091.409 | 15.42055 | NA |
| GBM_4_AutoML_20210517_094305 | 4112.423 | 64.12818 | 4112.423 | 15.27154 | NA |
| GBM_grid__1_AutoML_20210517_094305_model_5 | 4171.660 | 64.58839 | 4171.660 | 17.13950 | NA |
| GBM_2_AutoML_20210517_094305 | 4268.900 | 65.33682 | 4268.900 | 15.65400 | NA |
| StackedEnsemble_AllModels_AutoML_20210517_094305 | 4329.084 | 65.79578 | 4329.084 | 15.60418 | NA |
| StackedEnsemble_BestOfFamily_AutoML_20210517_094305 | 4343.234 | 65.90322 | 4343.234 | 15.36855 | NA |
| GBM_grid__1_AutoML_20210517_094305_model_6 | 4359.818 | 66.02892 | 4359.818 | 16.03464 | NA |
| GBM_grid__1_AutoML_20210517_094305_model_1 | 4550.951 | 67.46073 | 4550.951 | 16.47427 | NA |
| XRT_1_AutoML_20210517_094305 | 4579.862 | 67.67468 | 4579.862 | 16.69115 | 1.591159 |
| DRF_1_AutoML_20210517_094305 | 4878.819 | 69.84854 | 4878.819 | 14.79644 | 1.146071 |
| GBM_1_AutoML_20210517_094305 | 5043.679 | 71.01887 | 5043.679 | 14.24917 | NA |
| GBM_5_AutoML_20210517_094305 | 5173.990 | 71.93045 | 5173.990 | 19.08660 | NA |
| GBM_grid__1_AutoML_20210517_094305_model_4 | 5289.884 | 72.73159 | 5289.884 | 15.41215 | NA |
| GLM_1_AutoML_20210517_094305 | 6978.418 | 83.53693 | 6978.418 | 28.34823 | NA |
8.3.2 Best Model (Leader)
# Show the best model
automl_DPOL@leaderModel Details:
==============
H2ORegressionModel: gbm
Model ID: GBM_grid__1_AutoML_20210517_094305_model_7
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1 43 43 10607 4
max_depth mean_depth min_leaves max_leaves mean_leaves
1 4 4.00000 9 16 13.79070
H2ORegressionMetrics: gbm
** Reported on training data. **
MSE: 742.9795
RMSE: 27.25765
MAE: 7.378285
RMSLE: NaN
Mean Residual Deviance : 742.9795
H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 4022.796
RMSE: 63.42551
MAE: 15.04154
RMSLE: NaN
Mean Residual Deviance : 4022.796
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid
mae 15.045994 3.9332395 12.12232 11.87795 17.959179
mean_residual_deviance 4025.1875 2845.8782 2456.1912 2354.3364 8599.125
mse 4025.1875 2845.8782 2456.1912 2354.3364 8599.125
r2 0.5540781 0.14167649 0.5431397 0.5116641 0.44749096
residual_deviance 4025.1875 2845.8782 2456.1912 2354.3364 8599.125
rmse 60.5994 21.002983 49.559975 48.521503 92.73147
rmsle NaN 0.0 NaN NaN NaN
cv_4_valid cv_5_valid
mae 20.492651 12.777869
mean_residual_deviance 4990.901 1725.3843
mse 4990.901 1725.3843
r2 0.46939796 0.7986977
residual_deviance 4990.901 1725.3843
rmse 70.64631 41.537746
rmsle NaN NaN
8.3.3 Comparison (RMSE: Lower = Better)
d_eval_tmp <- data.frame(model = "Best Model from H2O AutoML",
RMSE_cv = automl_DPOL@leader@model$cross_validation_metrics@metrics$RMSE,
RMSE_test = h2o.rmse(h2o.performance(automl_DPOL@leader, newdata = h_test)))
d_eval <- rbind(d_eval, d_eval_tmp)
# Show the table
kable(d_eval)| model | RMSE_cv | RMSE_test |
|---|---|---|
| GBM: Gradient Boosting Model (Baseline) | 67.97348 | 51.12142 |
| GBM: Gradient Boosting Model (Manual Settings) | 67.87922 | 51.06768 |
| Best Model from H2O AutoML | 63.42551 | 45.14810 |
8.4 Making Predictions
# Make predictions
yhat_test <- h2o.predict(automl_DPOL@leader, newdata = h_test)
# Show the table
kable(head(yhat_test, 5))| predict |
|---|
| 35.388011 |
| 2.586987 |
| 577.323965 |
| 3.505565 |
| 4.594125 |
8.5 Explainable AI (Results in Context)
8.5.1 Global Explanations
# Show global explanations for best model from AutoML
h2o.explain(automl_DPOL@leader, newdata = h_test)
Residual Analysis
=================
> Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Variable Importance
===================
> The variable importance plot shows the relative importance of the most important variables in the model.
SHAP Summary
============
> SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
========================
> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectations
===================================
> An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
8.5.2 Local Explanations
# Show local explanations
h2o.explain_row(automl_DPOL@leader, newdata = h_test, row_index = 1)
SHAP explanation
================
> SHAP explanation shows contribution of features for a given instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function. H2O implements TreeSHAP which when the features are correlated, can increase contribution of a feature that had no influence on the prediction.
Individual Conditional Expectations
===================================
> Individual conditional expectations (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response for a given row. ICE plot is similar to partial dependence plot (PDP), PDP shows the average effect of a feature while ICE plot shows the effect for a single instance.
8.5.3 Local Contributions
# Make Predictions
predictions <- h2o.predict(automl_DPOL@leader, newdata = h_test)
# Show the table
kable(head(predictions, 5))| predict |
|---|
| 35.388011 |
| 2.586987 |
| 577.323965 |
| 3.505565 |
| 4.594125 |
# Calculate feature contributions for each sample
contributions <- h2o.predict_contributions(automl_DPOL@leader, newdata = h_test)
# Show the table
kable(head(contributions, 5))| Year | Region | Station | Replicate | lat | lon | depth | substrate | Season | Na2O | Magnesium | MagnesiumOxide | AluminumOxide | SiliconDioxide | Quartz | PhosphorusPentoxide | Sulphur | DDT_TOTAL | P.P.TDE | P.P.DDE | P.P.DDE.1 | HeptachlorEpoxide | Potassium | PotassiumOxide | Calcium | CalciumOxide | TitaniumDioxide | Chromium | Manganese | ManganeseOxide | XTR.Iron | Iron | IronOxide | Cobalt | Nickel | Copper | Zinc | Selenium | Strontium | Beryllium | Silver | Cadmium | Tin | TotalCarbon | OrganicCarbon | Carbon.Nitrogen_Ratio | TotalNitrogen | Mercury | Lead | Uranium | Vanadium | Arsenic | Chloride | Fluoride | Sand | Silt | Clay | Mean_grainsize | simpsonD | shannon | Chironomidae | Oligochaeta | Sphaeriidae | Diporeia | BiasTerm |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5.749369 | 0.1816891 | -0.6550707 | 0.0513092 | -1.121307 | -3.587872 | 0.6589137 | -1.3055155 | -1.1573941 | -0.3894154 | -0.9148800 | 0.5581922 | 0.0810013 | -0.0429149 | 0 | 0 | 0.1436372 | -0.1850808 | -0.1730304 | -0.1873940 | 0.4948193 | -0.0341789 | 0.5752563 | -0.1505208 | 0.8201120 | 0.1439855 | -0.1001838 | 0.0267208 | 4.236263 | 0 | 0 | 2.3089592 | -0.1621301 | 1.0973696 | -0.1827467 | -0.1396517 | -0.0555104 | -1.093223 | -0.0208815 | 0.3243970 | 0.3806432 | 0.3837683 | -0.7344639 | -0.1059432 | 0.0216723 | 1.9802601 | 0.0034754 | -0.3391146 | 0.9240335 | -0.0252398 | -0.0795804 | 0.1114721 | 3.3446767 | 0.8472175 | 0.1447516 | 2.7301340 | 0.6274920 | 0.7287988 | -2.3404412 | -1.4094365 | 2.6234705 | 3.4465351 | 0.396596 | -0.0002280 | 15.93437 |
| 4.999606 | 0.4208446 | -6.8606038 | 0.6449652 | -2.148659 | -3.505474 | 10.5673618 | 3.7239215 | 0.1442628 | -0.2539046 | -0.1725378 | -0.2667851 | -0.0750485 | 0.1492278 | 0 | 0 | 0.0924014 | -0.1850808 | -0.0531525 | -0.0666508 | 1.1268554 | 0.0786915 | -4.6323361 | 0.1549840 | 1.1303115 | -0.5802891 | 0.0175319 | -1.7433980 | 1.978953 | 0 | 0 | 2.2018185 | -0.1929169 | 0.0712235 | 0.0319320 | -4.1429248 | -0.0398978 | -1.866422 | -0.0096466 | 0.6521323 | -0.2015527 | -0.8943596 | -1.8839849 | -0.0093712 | 0.0096143 | 1.1426355 | -0.0062303 | -0.1420653 | -3.0697949 | -0.1489434 | -0.9779817 | -0.0369908 | 0.4098719 | 0.7658747 | -0.0861650 | 0.8774084 | -0.7691286 | -0.1723884 | -6.5317354 | -1.7592351 | 0.8794661 | -5.2189517 | 3.085558 | -0.0002280 | 15.93437 |
| 54.240490 | 1.3683642 | 133.0601501 | -2.8597975 | 48.040829 | 18.082205 | 49.4240570 | 22.9496117 | -5.3234730 | 9.7090874 | 18.1731472 | 1.4052534 | 0.0370039 | 1.4064714 | 0 | 0 | 0.0019574 | -0.0538082 | -0.1943945 | -0.0862775 | 7.8344207 | -0.0116882 | 19.3110600 | -0.0262053 | 16.5199566 | 0.0513210 | -0.0295170 | 3.8383923 | 31.386347 | 0 | 0 | 14.5122032 | -0.0840988 | -0.1139140 | -0.1669139 | 11.9667253 | -0.0496545 | 7.111960 | -0.0140443 | -0.8022944 | 2.4959888 | 1.3647826 | 8.2488222 | 0.0065774 | 0.0096143 | -0.4122683 | -0.0044202 | -0.1656154 | 5.7755022 | -0.0164728 | 2.8159142 | 0.1114721 | 23.0695019 | 10.2007303 | 0.0533646 | 6.1467013 | 0.0001291 | -0.0084739 | 17.2453384 | 12.1234789 | 17.2187881 | -8.0120039 | 2.508808 | -0.0016405 | 15.93437 |
| 5.659294 | 0.4179673 | -4.3749590 | -0.3660498 | -2.256765 | -3.953184 | -1.5347352 | -0.8403457 | 0.1382265 | -0.3811278 | -0.8222194 | -0.1025687 | -0.0143357 | -0.0302356 | 0 | 0 | 0.0250903 | -0.0521320 | -0.1157115 | -0.1829050 | 0.6061351 | 0.0398006 | -1.0461478 | -0.0527855 | -0.4770761 | 0.0835655 | 0.0001885 | -0.1409906 | 1.601940 | 0 | 0 | 0.8931869 | 0.1266488 | -0.0090013 | -0.0215884 | -0.1440378 | -0.0694553 | -1.504640 | -0.0489762 | 0.0892410 | -0.2314712 | 0.1139337 | -1.5825820 | -0.0037396 | 0.0163746 | -0.6378350 | 0.0034754 | -0.3322367 | 0.5104063 | -0.0212102 | -0.0670387 | -0.0369908 | -0.7185428 | 0.6596766 | -0.2510566 | 0.3896292 | -0.1075576 | 0.1227657 | -0.8817295 | -0.5447686 | 0.5173434 | -0.7864236 | 0.301690 | -0.0002280 | 15.93437 |
| 5.831995 | 0.4179673 | -5.0647283 | 0.3835035 | -2.256765 | -3.956096 | -1.5347352 | -0.8403702 | 0.6334915 | -0.3811278 | -0.7845055 | -0.1025687 | -0.0143357 | -0.0302356 | 0 | 0 | 0.0250903 | -0.0521320 | -0.1157115 | -0.1829050 | 0.7071946 | 0.0398006 | -1.0832016 | -0.0527855 | -0.4770761 | 0.0728301 | 0.0001885 | -0.1409906 | 1.552421 | 0 | 0 | 0.8931869 | 0.1283323 | -0.0090013 | -0.0215884 | -0.1440378 | -0.0694553 | -1.504640 | -0.0489762 | 0.0892410 | -0.2314712 | 0.1139337 | -1.6291332 | -0.0037396 | 0.0163746 | -0.6378350 | 0.0034754 | -0.2791277 | 0.5104063 | -0.0212102 | -0.0670387 | -0.0369908 | -0.7154787 | 0.6079237 | -0.2510566 | 0.3896292 | -0.1075576 | 0.1292654 | -0.7736058 | -0.3036894 | 0.5241177 | -0.7861704 | 0.301690 | -0.0002280 | 15.93437 |
9 Quick Recap
recap_1
recap_2
why_1
9.1 Learning Resources
- H2O Documentation: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html
- H2O Learning Center: https://training.h2o.ai/
- Responsible Machine Learning Link
- An Introduction to Machine Learning Interpretability Link
10 Your Turn - Get Your Hands Dirty!
- Try to build models for target “DBUG”. (Hint: you can change the target and reuse most of the code above.)
- Instead of using DPOL and DBUG, try using other variables as targets and build predictive models. (Hint: change features and targets)
- Try your own datasets.